This is a simple example analysis of data including import from Excel, data structuring and plotting. The data in this case happens to be optical density data over time (replicate growth curves for a microorganism) but the nature of the data matters little to the basics introduced.

Import OD data

library(readxl) # fast excel reader
#library(googlesheets) # fast google spreadsheet reader (not used here but could be useful)
data.raw <- read_excel("example.xlsx", skip = 1)
## Warning in strptime(x, format, tz = tz): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/Los_Angeles'

Show the raw data

library(knitr) # the package that renders R markdown and has some good additional functionality
kable(data.raw)
Time sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12
2015-10-05 10:00:00 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
2015-10-05 12:25:00 0.004 0.002 0.002 0.002 0.002 0.001 0.004 0.003 0.003 0.005 0.002 0.003
2015-10-05 14:10:00 0.003 0.005 0.002 0.004 0.003 0.002 0.003 0.003 0.005 0.003 0.004 0.005
2015-10-05 16:27:00 0.003 0.007 0.012 0.006 0.008 0.006 0.006 0.007 0.007 0.006 0.004 0.007
2015-10-05 18:55:00 0.014 0.014 0.015 0.020 0.018 0.017 0.020 0.015 0.016 0.010 0.015 0.016
2015-10-05 20:30:00 0.030 0.031 0.030 0.034 0.026 0.027 0.027 0.022 0.020 0.022 0.024 0.025
2015-10-06 03:10:00 0.034 0.036 0.036 0.064 0.059 0.057 0.104 0.107 0.123 0.136 0.158 0.147
2015-10-06 09:20:00 0.029 0.033 0.031 0.053 0.051 0.051 0.083 0.093 0.109 0.144 0.133 0.139
2015-10-06 11:10:00 0.029 0.029 0.032 0.050 0.050 0.048 0.081 0.090 0.105 0.141 0.131 0.133
2015-10-07 11:00:00 0.026 0.033 0.035 0.051 0.049 0.049 0.076 0.078 0.085 0.125 0.131 0.116

Restructuring the data

Turning the wide format excel data into long format. Note: here we make use of the pipe operator %>%, which just simplifies chaining operations.

library(tidyr) # for restructuring data very easily
data.long <- data.raw %>% gather(sample, OD600, -Time)
# melt <- gather(raw, sample, OD600, -Time) # this would be identical without using %>%

Introducing time in hours.

library(dplyr, warn.conflicts = FALSE) # powerful for doing calculations on data (by group, etc.)
data.long <- data.long %>% mutate(time.hrs = as.numeric(Time - Time[1], units = "hours"))

First plot of all the data

library(ggplot2) # powerful plotting package for aesthetics driven plotting

p1 <- 
  ggplot(data.long) + # initiate plot
  aes(x = time.hrs, y = OD600, color = sample) + # setup aesthetic mappings
  geom_point(size = 5) # add points to plot
print(p1) # output plot

Combining data by adding sample meta information from the spreadsheet’s second tab

data.info <- read_excel("example.xlsx", sheet = "info")

Show all information (these are the experimental conditions for each sample)

kable(data.info)
sample substrate
sample1 background H2
sample2 background H2
sample3 background H2
sample4 +0.5mM Ac
sample5 +0.5mM Ac
sample6 +0.5mM Ac
sample7 +1.25mM Ac
sample8 +1.25mM Ac
sample9 +1.25mM Ac
sample10 +2mM Ac
sample11 +2mM Ac
sample12 +2mM Ac

Combine OD data with sample information.

data.all <- merge(data.long, data.info, by = "sample")

Show us the datas

Reuse same plot using %+% to substitute the original data set with a new one and changing the color to be determined based on the new information we added (but keep everything else about the plot the same).

p1 %+% data.all %+% aes(color = substrate)

Summarize data

To make the figure a little bit easier to navigate, we’re going to summarize the data for each condition (combine the replicates) and replot it with an error band showing the whole range of data points for each condition. We could reuse the plot p1 again, but for clarity are constructing the plot from scratch instead.

data.sum <- data.all %>%
  group_by(time.hrs, substrate) %>%
  summarize(
    OD600.avg = mean(OD600),
    OD600.min = min(OD600),
    OD600.max = max(OD600))
data.sum %>% head() %>% kable() # show the first couple of lines
time.hrs substrate OD600.avg OD600.min OD600.max
0.000000 +0.5mM Ac 0.0010000 0.001 0.001
0.000000 +1.25mM Ac 0.0010000 0.001 0.001
0.000000 +2mM Ac 0.0010000 0.001 0.001
0.000000 background H2 0.0010000 0.001 0.001
2.416667 +0.5mM Ac 0.0016667 0.001 0.002
2.416667 +1.25mM Ac 0.0033333 0.003 0.004
p2 <- ggplot(data.sum) + # initiate plot
  aes(x = time.hrs, y = OD600.avg, ymin = OD600.min, ymax = OD600.max, 
      fill = substrate) + # setup global aesthetic mappings
  geom_ribbon(alpha = 0.3) + # value range (uses ymin and ymax, and fill for color)
  geom_line() + # connect averages (uses y)
  geom_point(shape = 21, size = 5) + # add points for averages (uses y and fill for color)
  theme_bw() + # style plot
  labs(title = "My plot", x = "Time [h]", y = "OD600", color = "Condition") # add labels

print(p2)

Note that we could also have had ggplot do the whole statistical summarising for us using stat_summary but it’s often helpful to have these values separately for other calcluations and purposes.

Now could e.g. focus on a subset of data but reuse same plot using %+% to substitute the original data set with a new one (but keep everythign else about the plot the same).

p2 %+% filter(data.sum, !grepl("background", substrate), time.hrs < 25)

Save this plot automatically as pdf by setting specific plot options in the r code chunk

print(p2)

Interactive plot

Last, you can make simple interactive (javascript) plots out of your original ggplots (plotly does not yet work great for all ggplot features but it’s a start for easy visualization). You can of course construct plotly plots without ggplot for more customization too but that’s for another time.

library(plotly, warn.conflicts = FALSE)
ggplotly(p1)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`